Introduction

Row

Executive Summary

This project leverages the marketing campaign data from a Portuguese banking institution to predict whether the client will subscribe a term deposit. Before classification, we perform an unsupervised learning, association rules, to explore the relationship between all variables. Then, we use 5 classification methods to build different trees and different kinds of boundaries. In the end, we select random forest model to predict through looking at ROC Curve and Confusion Matrix. Although this tree is difficult to visulize, we can know age, day, pday, and balance are also important, besides outcome and duration. Thus, we need to collect data in these variables more carefully, in order to improve the accuracy.

Row

The Problem

The bank data is related with direct marketing campaigns of a Portuguese banking institution. We use this information to explore what kinds of clients are likely to subscribe a term deposit.

Row

Data Sources

Created by: Paulo Cortez (Univ. Minho) and Sérgio Moro (ISCTE-IUL) @ 2012

Row

The Data

This original dataset has 4521 observations and 17 variables. For this analysis, we will create a new variable for creating ROC Curve.

Row

VARIABLES TO PREDICT WITH

  • age: the age of each client(numeric)
  • job: type of job (categorical: “admin.”,“unknown”,“unemployed”,“management”,“housemaid”,“entrepreneur”,“student”, “blue-collar”, “self-employed”, “retired”, “technician”, “services”)
  • marital: marital status (categorical: “married”,“divorced”,“single”; note: “divorced” means divorced or widowed)
  • education: (categorical: “unknown”,“secondary”,“primary”,“tertiary”)
  • default: has credit in default? (binary: “yes”,“no”)
  • balance: average yearly balance, in euros (numeric)
  • housing: has housing loan? (binary: “yes”,“no”)
  • loan: has personal loan? (binary: “yes”,“no”)
  • contact: contact communication type(categorical:“unknown”,“telephone”,“cellular”)
  • day: last contact day of the month (numeric)
  • month: last contact month of year (categorical: “jan”, “feb”, “mar”, …, “nov”, “dec”)
  • duration: last contact duration, in seconds (numeric)
  • campaign: number of contacts performed during this campaign and for this client (numeric, includes last contact)
  • pdays: number of days that passed by after the client was last contacted from a previous campaign (numeric, -1 means client was not previously contacted)
  • previous: number of contacts performed before this campaign and for this client (numeric)
  • poutcome: outcome of the previous marketing campaign (categorical: “unknown”,“other”,“failure”,“success”)

VARIABLES WE WANT TO PREDICT

  • y: has the client subscribed a term deposit(binary: “yes”,“no”)

Data Exploration

Column

Examine the Data

Now we can to organize the data
      age                 job          marital         education   
 Min.   :19.00   management :969   divorced: 528   primary  : 678  
 1st Qu.:33.00   blue-collar:946   married :2797   secondary:2306  
 Median :39.00   technician :768   single  :1196   tertiary :1350  
 Mean   :41.17   admin.     :478                   unknown  : 187  
 3rd Qu.:49.00   services   :417                                   
 Max.   :87.00   retired    :230                                   
                 (Other)    :713                                   
 default       balance      housing     loan           contact    
 no :4445   Min.   :-3313   no :1962   no :3830   cellular :2896  
 yes:  76   1st Qu.:   69   yes:2559   yes: 691   telephone: 301  
            Median :  444                         unknown  :1324  
            Mean   : 1423                                         
            3rd Qu.: 1480                                         
            Max.   :71188                                         
                                                                  
      day            month         duration       campaign     
 Min.   : 1.00   may    :1398   Min.   :   4   Min.   : 1.000  
 1st Qu.: 9.00   jul    : 706   1st Qu.: 104   1st Qu.: 1.000  
 Median :16.00   aug    : 633   Median : 185   Median : 2.000  
 Mean   :15.92   jun    : 531   Mean   : 264   Mean   : 2.794  
 3rd Qu.:21.00   nov    : 389   3rd Qu.: 329   3rd Qu.: 3.000  
 Max.   :31.00   apr    : 293   Max.   :3025   Max.   :50.000  
                 (Other): 571                                  
     pdays           previous          poutcome      y       
 Min.   : -1.00   Min.   : 0.0000   failure: 490   no :4000  
 1st Qu.: -1.00   1st Qu.: 0.0000   other  : 197   yes: 521  
 Median : -1.00   Median : 0.0000   success: 129             
 Mean   : 39.77   Mean   : 0.5426   unknown:3705             
 3rd Qu.: -1.00   3rd Qu.: 0.0000                            
 Max.   :871.00   Max.   :25.0000                            
                                                             
     y_ind       
 Min.   :0.0000  
 1st Qu.:0.0000  
 Median :0.0000  
 Mean   :0.1152  
 3rd Qu.:0.0000  
 Max.   :1.0000  
                 

Data Visualization

Row

Categorical Variables

  • We can see the distribution of clients who subscribed a term deposit aross job, marital, education, and default etc.

Row

y yes(1)/no(0)

job

marital

education

default

housing

loan

contact

month

poutcome

Row

Numerical Variables

  • We can see the different distributions of 2 outcomes in this campaign. We can observe the difference between them.

Row

age

balance

day

duration

campaign

pdays

Association Rules

Association Rules

In order to observe which categorical variables are more important to the prediction,wWe select all categorical variable to build association rules, and we only use y=yes to be the right hand side.

Row

create association rules

We sort the rules by lift.
    lhs                   rhs        support confidence     lift count
[1] {housing=no,                                                      
     loan=no,                                                         
     poutcome=success} => {y=yes} 0.01371378  0.7126437 6.183996    62
[2] {default=no,                                                      
     housing=no,                                                      
     loan=no,                                                         
     poutcome=success} => {y=yes} 0.01371378  0.7126437 6.183996    62
[3] {housing=no,                                                      
     loan=no,                                                         
     contact=cellular,                                                
     poutcome=success} => {y=yes} 0.01194426  0.7105263 6.165623    54
[4] {default=no,                                                      
     housing=no,                                                      
     loan=no,                                                         
     contact=cellular,                                                
     poutcome=success} => {y=yes} 0.01194426  0.7105263 6.165623    54
[5] {housing=no,                                                      
     poutcome=success} => {y=yes} 0.01393497  0.7078652 6.142531    63
[6] {default=no,                                                      
     housing=no,                                                      
     poutcome=success} => {y=yes} 0.01393497  0.7078652 6.142531    63

Visualization of Association Rules

This plot can tell us the important categorical variables and levels. We can see that poutcome=success, loan=no, contact=celluar, marital=married, and housing=no can improve the probability to subcribe a term deposit.

Plot

Classification Analysis

Predict if the term deposit will be subscribed

We predict whether the client will subscribe a term deposit. For this analysis we will perform a classification tree using different methods, like decision tree, random forest, boosting, svc, and svm.

Row

Decision Tree

We use cross validation to find out the optimal parameter values for decision tree.

Row

Using CV to fit a rpart tree

When cp=0.02, we can have higher accuracy
CART 

2713 samples
  17 predictor
   2 classes: 'no', 'yes' 

No pre-processing
Resampling: Cross-Validated (10 fold, repeated 3 times) 
Summary of sample sizes: 2441, 2442, 2442, 2442, 2442, 2442, ... 
Resampling results across tuning parameters:

  cp          Accuracy   Kappa    
  0.01597444  0.8935985  0.3338781
  0.02076677  0.8940914  0.3401855
  0.04579340  0.8865965  0.1661075

Accuracy was used to select the optimal model using the largest value.
The final value used for the model was cp = 0.02076677.

Row

Variable Importance

Visualizing the tree

We can see this tree only use duration and poutcome to build. When the contact duration was over 986 sec, clients were more likely to subscribe. Also, when the duration was less than 645 sec, and the outcome was success in previous campaign, clients were more willing to subscribe.

Tree

Row

ROC Curve for Decision Tree

Confusion Matrix

     pred_tree
        no  yes
  no  1557   43
  yes  151   57
 Accuracy 
0.8926991 
Sensitivity 
       0.57 
Specificity 
  0.9115925 

Row

Analysis Summary

With this model, we can see duration and poutcome are 2 variables that are useful to predict. but the ability to find ‘yes’ is not good through observing ROC Curve.

Row

Random Forest

We use cross validation to find out the optimal parameter values in random forest.

Row

Using CV to find out appropriate mtry

We know when mytry=22, the accuracy is higher in training data.
Random Forest 

2713 samples
  17 predictor
   2 classes: 'no', 'yes' 

No pre-processing
Resampling: Cross-Validated (10 fold, repeated 3 times) 
Summary of sample sizes: 2442, 2442, 2442, 2442, 2442, 2441, ... 
Resampling results across tuning parameters:

  mtry  Accuracy   Kappa     
   2    0.8863523  0.04083843
  22    0.8904082  0.37200646
  42    0.8879468  0.36981509

Accuracy was used to select the optimal model using the largest value.
The final value used for the model was mtry = 22.

Variable Importance

Row

ROC Curve for Random Forest

Confusion Matrix

     pred_tree
        no  yes
  no  1544   56
  yes  125   83
 Accuracy 
0.8998894 
Sensitivity 
  0.5971223 
Specificity 
  0.9251049 

Row

Analysis Summary

With this model, we can find more 1s, and more variables can help decide whether the clients are willing to subscribe.

Row

Boosting

We use cross cross validation to look for optimal parameters to build the boosted model.

Row

Using CV to find out appropriate parameter

A gradient boosted model with bernoulli loss function.
1800 iterations were performed.
There were 42 predictors of which 39 had non-zero influence.

Variable Importance

Row

ROC Curve for Boosting

Confusion Matrix

We can see the accuracy, sensitivity, and specificity are little higher than the previous one.
     pred_tree
        no  yes
  no  1548   52
  yes  131   77
 Accuracy 
0.8987832 
Sensitivity 
  0.5968992 
Specificity 
  0.9219774 

Row

SVC

We use SVC to see whether a linear boundary can cut the data well.

Row

Using CV to find out appropriate tuning parameter, C

We try different C from 1 to 3.
Support Vector Machines with Linear Kernel 

2713 samples
  17 predictor
   2 classes: 'no', 'yes' 

No pre-processing
Resampling: Cross-Validated (10 fold, repeated 3 times) 
Summary of sample sizes: 2442, 2442, 2441, 2442, 2441, 2442, ... 
Resampling results across tuning parameters:

  C  Accuracy   Kappa    
  1  0.8928655  0.2262859
  2  0.8928655  0.2262859
  3  0.8928655  0.2262859

Accuracy was used to select the optimal model using the largest value.
The final value used for the model was C = 1.

Creating model with C = 1

 Setting default kernel parameters  
Support Vector Machine object of class "ksvm" 

SV type: C-svc  (classification) 
 parameter : cost C = 1 

Linear (vanilla) kernel function. 

Number of Support Vectors : 672 

Objective Function Value : -582 
Training error : 0.106893 
Probability model included. 

Row

ROC Curve for SVC

Confusion Matrix

Through confusion matrix, we can see this model does not perform as good as boosted model.
     pred_tree
        no  yes
  no  1581   19
  yes  175   33
 Accuracy 
0.8926991 
Sensitivity 
  0.6346154 
Specificity 
  0.9003417 

Row

Analysis Summary

It is very clearly that the linear boudary is not right to the data.

Row

SV Machine

As the linear boudary doesn’t work well, we want to try a non-linear boudary.

Row

Using CV to find out optimal parameter values

We try ranges of parameter values for degree, scale, and C to see the optimal one.
Support Vector Machines with Polynomial Kernel 

2713 samples
  17 predictor
   2 classes: 'no', 'yes' 

No pre-processing
Resampling: Cross-Validated (10 fold, repeated 3 times) 
Summary of sample sizes: 2442, 2441, 2442, 2442, 2442, 2441, ... 
Resampling results across tuning parameters:

  degree  scale  C  Accuracy   Kappa    
  2       0.001  1  0.8931088  0.2188177
  2       0.001  2  0.8931088  0.2188177
  2       0.001  3  0.8931088  0.2188177
  2       0.010  1  0.8926131  0.2573664
  2       0.010  2  0.8937220  0.2890638
  2       0.010  3  0.8913854  0.2912875
  2       0.100  1  0.8810660  0.3630610
  2       0.100  2  0.8736940  0.3519029
  2       0.100  3  0.8703784  0.3482293
  3       0.001  1  0.8931088  0.2188177
  3       0.001  2  0.8931088  0.2188177
  3       0.001  3  0.8932318  0.2201688
  3       0.010  1  0.8908934  0.2892829
  3       0.010  2  0.8896611  0.3088018
  3       0.010  3  0.8907704  0.3342062
  3       0.100  1  0.8660729  0.3366554
  3       0.100  2  0.8622626  0.3244073
  3       0.100  3  0.8612804  0.3240682
  4       0.001  1  0.8931088  0.2188177
  4       0.001  2  0.8934773  0.2231195
  4       0.001  3  0.8932313  0.2211710
  4       0.010  1  0.8917494  0.3216049
  4       0.010  2  0.8911398  0.3487105
  4       0.010  3  0.8901599  0.3586792
  4       0.100  1  0.8642320  0.3325546
  4       0.100  2  0.8596863  0.3204228
  4       0.100  3  0.8579639  0.3151879

Accuracy was used to select the optimal model using the largest value.
The final values used for the model were degree = 2, scale = 0.01 and C
 = 2.

Creating model with degree = 4, scale = 0.01, C = 1

Support Vector Machine object of class "ksvm" 

SV type: C-svc  (classification) 
 parameter : cost C = 1 

Polynomial kernel function. 
 Hyperparameters : degree =  4  scale =  0.01  offset =  1 

Number of Support Vectors : 660 

Objective Function Value : -585.024 
Training error : 0.100258 
Probability model included. 

Row

ROC Curve for SVM

Confusion Matrix

We can see the non-linear boudary is not good.
     pred_tree
        no  yes
  no  1581   19
  yes  175   33
 Accuracy 
0.8926991 
Sensitivity 
  0.6346154 
Specificity 
  0.9003417 

Row

Analysis Summary

We are not likely to use this model to predict.

Row

SV Machine Radial Basis

We use cross validation to find out optimal parameter like previously.

Row

Using CV to find out optimal parameter values

We try ranges of parameter values for sigma, and C to see the optimal one.
Support Vector Machines with Radial Basis Function Kernel 

2713 samples
  17 predictor
   2 classes: 'no', 'yes' 

No pre-processing
Resampling: Cross-Validated (10 fold, repeated 3 times) 
Summary of sample sizes: 2442, 2442, 2441, 2442, 2442, 2441, ... 
Resampling results across tuning parameters:

  sigma  C  Accuracy   Kappa    
  0.001  1  0.8931042  0.2169900
  0.001  2  0.8931042  0.2169900
  0.001  3  0.8931042  0.2169900
  0.010  1  0.8929876  0.2515855
  0.010  2  0.8927389  0.2897767
  0.010  3  0.8922446  0.2863099
  0.100  1  0.8854904  0.1683339
  0.100  2  0.8825442  0.1450180
  0.100  3  0.8829137  0.1406028

Accuracy was used to select the optimal model using the largest value.
The final values used for the model were sigma = 0.001 and C = 1.

Creating model with sigma = 0.001, C = 1

Support Vector Machine object of class "ksvm" 

SV type: C-svc  (classification) 
 parameter : cost C = 1 

Gaussian Radial Basis kernel function. 
 Hyperparameter : sigma =  0.001 

Number of Support Vectors : 649 

Objective Function Value : -625.2013 
Training error : 0.11537 
Probability model included. 

Row

ROC Curve for SVM Radial Basis

Confusion Matrix

In this confusion matrix, we can see all target variable in testing data is predicted as no.
     pred_tree
        no  yes
  no  1600    0
  yes  208    0
 Accuracy 
0.8849558 
Sensitivity 
         NA 
Specificity 
  0.8849558 

Row

Analysis Summary

Although the AUC in this model is larger than the AUC in models of SVC and SVM with 4 degree, it cannot find 1s appropriately, so we are unlikely to use it.

Conclusion

Row

Summary

In Conclusion, we can see random forest and boosting can help us predict better, so we are likely to select an optimal model from all models by observing ROC Curve, and Confusion Matrix. In the end, we choose the random forest model to help us predict, as the ROC Curve is more close to the top left than The Curve in boosted model.

Row

Combining the ROC Curve in all models

Row

Creating a summary of all models showing accuracy, sensitivity, and specificity.

Model                    Accuracy  Sensitivity  Specificity
decision tree              0.89       0.57         0.91
random forest              0.90       0.60         0.92
boosting                   0.90       0.60         0.93
SVC C=1                    0.89       0.57         0.88
SVM_4d scale=/C= 1         0.89       0.63         0.90
SVM_rad sigma=0.01/C= 1    0.88       0.00         0.88
---
title: "Classification Problem"
output: 
  flexdashboard::flex_dashboard:
    orientation: columns
    vertical_layout: scroll
    source_code: embed
    theme: united
---

```{r setup, include=FALSE,warning=FALSE}
#include=FALSE will not include r code in output
#warning=FALSE will remove any warnings from output

library(grid)
library(arules) #for association rules
library(arulesViz) #creates nicer plots to show rules
library(GGally)
library(broom) #for tidy() function
library(rpart) #Partition package to create trees
library(rpart.plot) #creates nicer tree plots
library(caret) #Classification And REgression Training - will use to create confusion matrix
library(plotROC) #needed for ggplot geom_roc()
library(flexdashboard)
library(tree)
library(tidyverse)
library(janitor)
library(kernlab)
library(gbm)
theme_set(theme_bw())
```

```{r load_data}
#Load the data
df <- read.table('bank.csv', header = T, sep =";")
df <- df %>%
  mutate(job = gsub('blue-collar', 'blue_collar', df$job)) %>%
  mutate(job = gsub("self-employed", "self_employed", df$job)) %>%
  mutate(job = as.factor(job)) %>%
  mutate(y_ind = ifelse(y == 'yes', 1, 0))

```

Introduction {data-orientation=rows}
=======================================================================

Row
-----------------------------------------------------------------------
### Executive Summary
This project leverages the marketing campaign data from a Portuguese banking institution to predict whether the client will subscribe a term deposit. Before classification, we perform an unsupervised learning, association rules, to explore the relationship between all variables. Then, we use 5 classification methods to build different trees and different kinds of boundaries. In the end, we select random forest model to predict through looking at ROC Curve and Confusion Matrix. Although this tree is difficult to visulize, we can know age, day, pday, and balance are also important, besides outcome and duration. Thus, we need to collect data in these variables more carefully, in order to improve the accuracy.

Row
-----------------------------------------------------------------------
### The Problem
The bank data is related with direct marketing campaigns of a Portuguese banking institution. We use this information to explore what kinds of clients are likely to subscribe a term deposit.

Row
-----------------------------------------------------------------------
### Data Sources
Created by: Paulo Cortez (Univ. Minho) and Sérgio Moro (ISCTE-IUL) @ 2012

Row
-----------------------------------------------------------------------
### The Data
This original dataset has 4521 observations and 17 variables. For this analysis, we will create a new variable for creating ROC Curve.


Row
-----------------------------------------------------------------------
VARIABLES TO PREDICT WITH

* *age*: the age of each client(numeric) 
* *job*: type of job (categorical: "admin.","unknown","unemployed","management","housemaid","entrepreneur","student", "blue-collar", "self-employed", "retired", "technician", "services")
* *marital*: marital status (categorical: "married","divorced","single"; note: "divorced" means divorced or widowed) 
* *education*: (categorical: "unknown","secondary","primary","tertiary") 
* *default*: has credit in default? (binary: "yes","no")
* *balance*: average yearly balance, in euros (numeric) 
* *housing*: has housing loan? (binary: "yes","no") 
* *loan*: has personal loan? (binary: "yes","no")
* *contact*: contact communication type(categorical:"unknown","telephone","cellular")
* *day*: last contact day of the month (numeric)
* *month*: last contact month of year (categorical: "jan", "feb", "mar", ..., "nov", "dec")
* *duration*: last contact duration, in seconds (numeric)
* *campaign*: number of contacts performed during this campaign and for this client (numeric, includes last contact)
* *pdays*: number of days that passed by after the client was last contacted from a previous campaign (numeric, -1 means client was not previously contacted)
* *previous*: number of contacts performed before this campaign and for this client (numeric)
* *poutcome*: outcome of the previous marketing campaign (categorical: "unknown","other","failure","success")

VARIABLES WE WANT TO PREDICT

* *y*:  has the client subscribed a term deposit(binary: "yes","no")

Data Exploration {data-orientation=rows}
=======================================================================

Column {.sidebar data-width=500}
-------------------------------------

### Data Overview 
We can see most clients did not subscribe a term deposit after marketing campaigns. The minimum age of clients was 19, and the maximum was 87. Most clients were from 33 to 49 years old. Clients who worked in management industry occupied the biggest part, followed by clients who were blue collar. More than half of clients were married and had finished secondary education. Just a few clients had credit in default. The range of balance was from -3313 to 71188, and the mean was 1423, and the median was 444. Most clients had housing loan but no personal loan. The main contact method was cellular, and some of them was unknown. The clients were likely to be called in the middle of the month, especially in May. The mean of the last contact duration(in seconds) was 264, and the minimum was 4 sec, and the maximum was 3025. The mean and median of the number of contacts were 2.794 and 2, respectively. Most clients were not contacted from a previous campaign. The most outcome of last campaign was unknown and failure.

Column {data-width=500}
-----------------------------------------------------------------------
### Examine the Data
Now we can to organize the data
```{r, cache=TRUE}
#View data
summary(df)
```


Data Visualization {data-orientation=rows}
=======================================================================

Row {data-height=100}
-----------------------------------------------------------------------
### Categorical Variables
* We can see the distribution of clients who subscribed a term deposit aross job, marital, education, and default etc.

Row {.tabset}
-----------------------------------------------------------------------
### y yes(1)/no(0)
```{r, cache=TRUE}
g <- ggplot(df, aes(y))
g + geom_bar(width = 0.5) + 
  theme(axis.text.x = element_text(angle=65, vjust=0.6)) + 
  labs(title="Bar Chart on Categorical Variable", 
       subtitle="target variable y")
```


### job

```{r, cache=TRUE}
g <- ggplot(df, aes(job))
g + geom_bar(aes(fill = y), width = 0.5) + 
  theme(axis.text.x = element_text(angle=65, vjust=0.6)) + 
  labs(title="Bar Chart on Categorical Variable", 
       subtitle="job")
```


### marital

```{r, cache=TRUE}
g <- ggplot(df, aes(marital))
g + geom_bar(aes(fill = y), width = 0.5) + 
  theme(axis.text.x = element_text(angle=65, vjust=0.6)) + 
  labs(title="Bar Chart on Categorical Variable", 
       subtitle="marital")
```


### education

```{r, cache=TRUE}
g <- ggplot(df, aes(education))
g + geom_bar(aes(fill = y), width = 0.5) + 
  theme(axis.text.x = element_text(angle=65, vjust=0.6)) + 
  labs(title="Bar Chart on Categorical Variable", 
       subtitle="education")
```


### default

```{r, cache=TRUE}
g <- ggplot(df, aes(default))
g + geom_bar(aes(fill = y), width = 0.5) + 
  theme(axis.text.x = element_text(angle=65, vjust=0.6)) + 
  labs(title="Bar Chart on Categorical Variable", 
       subtitle="default")
```


### housing

```{r, cache=TRUE}
g <- ggplot(df, aes(housing))
g + geom_bar(aes(fill = y), width = 0.5) + 
  theme(axis.text.x = element_text(angle=65, vjust=0.6)) + 
  labs(title="Bar Chart on Categorical Variable", 
       subtitle="housing")
```


### loan

```{r, cache=TRUE}
g <- ggplot(df, aes(loan))
g + geom_bar(aes(fill = y), width = 0.5) + 
  theme(axis.text.x = element_text(angle=65, vjust=0.6)) + 
  labs(title="Bar Chart on Categorical Variable", 
       subtitle="loan")
```


### contact

```{r, cache=TRUE}
g <- ggplot(df, aes(contact))
g + geom_bar(aes(fill = y), width = 0.5) + 
  theme(axis.text.x = element_text(angle=65, vjust=0.6)) + 
  labs(title="Bar Chart on Categorical Variable", 
       subtitle="contact")
```


### month

```{r, cache=TRUE}
g <- ggplot(df, aes(month))
g + geom_bar(aes(fill = y), width = 0.5) + 
  theme(axis.text.x = element_text(angle=65, vjust=0.6)) + 
  labs(title="Bar Chart on Categorical Variable", 
       subtitle="month")
```


### poutcome

```{r, cache=TRUE}
g <- ggplot(df, aes(poutcome))
g + geom_bar(aes(fill = y), width = 0.5) + 
  theme(axis.text.x = element_text(angle=65, vjust=0.6)) + 
  labs(title="Bar Chart on Categorical Variable", 
       subtitle="poutcome")
```


Row {data-height=100}
-----------------------------------------------------------------------

### Numerical Variables

* We can see the different distributions of 2 outcomes in this campaign. We can observe the difference between them.


Row {.tabset}
-----------------------------------------------------------------------
### age

```{r, cache=TRUE}
g <- ggplot(df, aes(age))
g + geom_density(aes(fill=factor(y)), alpha=0.8) + 
    labs(title="Density Plot on Numerical Variable", 
         subtitle="age by number of term deposits subscribed",
         x="age",
         fill="y")
```


### balance

```{r, cache=TRUE}
g <- ggplot(df, aes(balance))
g + geom_density(aes(fill=factor(y)), alpha=0.8) + 
    labs(title="Density Plot on Numerical Variable", 
         subtitle="balance by number of term deposits subscribed",
         x="balance",
         fill="y")
```


### day

```{r, cache=TRUE}
g <- ggplot(df, aes(day))
g + geom_density(aes(fill=factor(y)), alpha=0.8) + 
    labs(title="Density Plot on Numerical Variable", 
         subtitle="day by number of term deposits subscribed",
         x="day",
         fill="y")
```


### duration

```{r, cache=TRUE}
g <- ggplot(df, aes(duration))
g + geom_density(aes(fill=factor(y)), alpha=0.8) + 
    labs(title="Density Plot on Numerical Variable", 
         subtitle="duration by number of term deposits subscribed",
         x="duration",
         fill="y")
```


### campaign

```{r, cache=TRUE}
g <- ggplot(df, aes(campaign))
g + geom_density(aes(fill=factor(y)), alpha=0.8) + 
    labs(title="Density Plot on Numerical Variable", 
         subtitle="campaign by number of term deposits subscribed",
         x="campaign",
         fill="y")
```


### pdays

```{r, cache=TRUE}
g <- ggplot(df, aes(pdays))
g + geom_density(aes(fill=factor(y)), alpha=0.8) + 
    labs(title="Density Plot on Numerical Variable", 
         subtitle="pdays by number of term deposits subscribed",
         x="pdays",
         fill="y")
```


### previous

```{r, cache=TRUE}
g <- ggplot(df, aes(previous))
g + geom_density(aes(fill=factor(y)), alpha=0.8) + 
    labs(title="Density Plot on Numerical Variable", 
         subtitle="previous by number of term deposits subscribed",
         x="previous",
         fill="y")
```


Association Rules {data-orientation=rows}
=======================================================================
### Association Rules
In order to observe which categorical variables are more important to the prediction,wWe select all categorical variable to build association rules, and we only use y=yes to be the right hand side.

Row {data-height=800}
-----------------------------------------------------------------------

### create association rules
We sort the rules by lift.
```{r, cache=TRUE}
CATdf <- df %>%
  select_if(is.factor)
apriori_rules <- apriori(CATdf, parameter = list(supp = 0.01, conf = 0.5),  appearance = list(rhs = c('y=yes')), control = list (verbose=F))
apriori_rules_sort <- sort(apriori_rules, by='lift')
inspect(head(apriori_rules_sort))
```

### Visualization of Association Rules
This plot can tell us the important categorical variables and levels. We can see that poutcome=success, loan=no, contact=celluar, marital=married, and housing=no can improve the probability to subcribe a term deposit.

### Plot
```{r, cache=TRUE}
plot(x = apriori_rules, method = 'paracoord')
```

Classification Analysis {data-orientation=rows}
=======================================================================
### Predict if the term deposit will be subscribed
We predict whether the client will subscribe a term deposit. For this analysis we will perform a classification tree using different methods, like decision tree, random forest, boosting, svc, and svm.


```{r, cache=TRUE}
set.seed(123)
inTraining <- createDataPartition(df$y, 
                                  p = .6,
                                  list = FALSE)
df_train <- df[inTraining, ]
df_test  <- df[-inTraining, ]
```

Row
-----------------------------------------------------------------------
### Decision Tree
We use cross validation to find out the optimal parameter values for decision tree.

Row
-----------------------------------------------------------------------
### Using CV to fit a rpart tree
When cp=0.02, we can have higher accuracy
```{r, cache=TRUE}
fit_control <- trainControl(method = 'repeatedcv',
                            number = 10,
                            repeats = 3)

cv_tree <- train(y ~ .-y_ind,
                 data = df_train,
                 method = 'rpart',
                 trControl = fit_control)
cv_tree
```

Row {data-height=800}
-----------------------------------------------------------------------
### Variable Importance
```{r, cache=TRUE}
imp <- varImp(cv_tree)$importance
rn <- row.names(imp)
imp_df <- tibble(variable = rn, 
                 importance = imp$Overall) %>%
  arrange(desc(-importance)) %>%
  mutate(variable = factor(variable, variable))

p <- ggplot(data = imp_df,
            aes(variable, importance))

p + geom_col() +
  coord_flip() +
  ggtitle('Decision Tree')
```

### Visualizing the tree
We can see this tree only use duration and poutcome to build. When the contact duration was over 986 sec, clients were more likely to subscribe. Also, when the duration was less than 645 sec, and the outcome was success in previous campaign, clients were more willing to subscribe.

### Tree
```{r, cache=TRUE}
rpart.plot(cv_tree$finalModel, type = 4)
```


Row
-----------------------------------------------------------------------
### ROC Curve for Decision Tree
```{r, cache=TRUE}
fits_all <- mutate(df_test,
                   dt_pprobs = predict(cv_tree, newdata = df_test, type = 'prob')[, 2])

p <- ggplot(data = fits_all,
            aes(d = y_ind, m = dt_pprobs)) +
  geom_roc(n.cuts = 0, col = 'navy') +
  style_roc()
p + annotate('text', x = .75, y = .25,
             label = paste('AUC = ', round(calc_auc(p)$AUC, 3)))
```


### Confusion Matrix
```{r, cache=TRUE}
pred_tree <- predict(cv_tree, newdata = df_test, type = "raw")
cm <- confusionMatrix(table(df_test$y, pred_tree),positive = "yes")
cm$table
cm$overall["Accuracy"]
cm$byClass["Sensitivity"]
cm$byClass["Specificity"]

```

Row
-----------------------------------------------------------------------
### Analysis Summary
With this model, we can see duration and poutcome are 2 variables that are useful to predict. but the ability to find 'yes' is not good through observing ROC Curve. 


Row 
-----------------------------------------------------------------------
### Random Forest
We use cross validation to find out the optimal parameter values in random forest.

Row 
-----------------------------------------------------------------------
### Using CV to find out appropriate mtry
We know when mytry=22, the accuracy is higher in training data.
```{r, cache=TRUE}
cv_rftree <- train(y ~ .-y_ind,
                 data = df_train,
                 method = 'rf',
                 ntree = 50,
                 trControl = fit_control)
cv_rftree
```

### Variable Importance

```{r, cache=TRUE}
imp <- varImp(cv_rftree)$importance
rn <- row.names(imp)
imp_df <- tibble(variable = rn, 
                 importance = imp$Overall) %>%
  arrange(desc(-importance)) %>%
  mutate(variable = factor(variable, variable))

p <- ggplot(data = imp_df,
            aes(variable, importance))

p + geom_col() +
  coord_flip() +
  ggtitle('Ramdon Forest')
```


Row
-----------------------------------------------------------------------
### ROC Curve for Random Forest

```{r, cache=TRUE}
fits_all <- mutate(fits_all,
                   rf_pprobs = predict(cv_rftree, newdata = df_test, type = 'prob')[, 2])

p <- ggplot(data = fits_all,
            aes(d = y_ind, m = rf_pprobs)) +
  geom_roc(n.cuts = 0, col = 'navy') +
  style_roc()
p + annotate('text', x = .75, y = .25,
             label = paste('AUC = ', round(calc_auc(p)$AUC, 3)))
```


### Confusion Matrix

```{r, cache=TRUE}
pred_tree <- predict(cv_rftree, newdata = df_test, type = "raw")
cm <- confusionMatrix(table(df_test$y, pred_tree),positive = "yes")
cm$table
cm$overall["Accuracy"]
cm$byClass["Sensitivity"]
cm$byClass["Specificity"]
```


Row
-----------------------------------------------------------------------

### Analysis Summary
With this model, we can find more 1s, and more variables can help decide whether the clients are willing to subscribe.


Row
-----------------------------------------------------------------------
### Boosting
We use cross cross validation to look for optimal parameters to build the boosted model.

Row
-----------------------------------------------------------------------
### Using CV to find out appropriate parameter

```{r, cache=TRUE}
set.seed(21)
grid <- expand.grid(interaction.depth = c(1, 2, 3),
                    n.trees = seq(0, 2000, by = 100),
                    shrinkage = c(.01, 0.001),
                    n.minobsinnode = 10)

trainControl <- trainControl(method = "cv", number = 5)

gbm_y <- train(y ~ .-y_ind,
                 data = df_train,
                 distribution = 'bernoulli',
                 method = 'gbm',
                 trControl = trainControl,
                 tuneGrid = grid,
                 verbose = FALSE)
gbm_y$finalModel
```

### Variable Importance

```{r, cache=TRUE}
imp <- varImp(gbm_y)$importance
rn <- row.names(imp)
imp_df <- tibble(variable = rn, 
                 importance = imp$Overall) %>%
  arrange(desc(-importance)) %>%
  mutate(variable = factor(variable, variable))

p <- ggplot(data = imp_df,
            aes(variable, importance))

p + geom_col() +
  coord_flip() +
  ggtitle('Boosting')
```

Row {data-width=600, data-height=800}
-----------------------------------------------------------------------
### ROC Curve for Boosting

```{r, cache=TRUE}
fits_all <- mutate(fits_all,
                   gbm_pprobs = predict(gbm_y, newdata = df_test, type = 'prob')[, 2])

p <- ggplot(data = fits_all,
            aes(d = y_ind, m = gbm_pprobs)) +
  geom_roc(n.cuts = 0, col = 'navy') +
  style_roc()
p + annotate('text', x = .75, y = .25,
             label = paste('AUC = ', round(calc_auc(p)$AUC, 3)))
```


### Confusion Matrix
We can see the accuracy, sensitivity, and specificity are little higher than the previous one.
```{r, cache=TRUE}
pred_tree <- predict(gbm_y, newdata = df_test, type = "raw")
cm <- confusionMatrix(table(df_test$y, pred_tree),positive = "yes") #confusionMatrix is from the caret package
cm$table
cm$overall["Accuracy"]
cm$byClass["Sensitivity"]
cm$byClass["Specificity"]
```

Row
-----------------------------------------------------------------------
### SVC
We use SVC to see whether a linear boundary can cut the data well.

Row
-----------------------------------------------------------------------
#### Using CV to find out appropriate tuning parameter, C
We try different C from 1 to 3.
```{r, cache=TRUE}
cv_svc <- train(y ~ .-y_ind,
                      data = df_train,
                      method = 'svmLinear',
                      trControl = fit_control,
                      tuneGrid = data.frame(C = 1:3))

cv_svc
```

### Creating model with C = 1
```{r, cache=TRUE}
svc_y <- ksvm(y ~ .-y_ind,
              data = df_train,
              type = 'C-svc',
              kernel = 'vanilladot',
              prob.model = TRUE)

svc_y
```

Row 
-----------------------------------------------------------------------
### ROC Curve for SVC

```{r, cache=TRUE}
fits_all <- mutate(fits_all,
                   svc_pprobs = predict(svc_y, newdata = df_test, type = 'prob')[, 2])

p <- ggplot(data = fits_all,
            aes(d= y_ind, m = svc_pprobs)) +
  geom_roc(n.cuts = 0, col = 'navy') +
  style_roc()
p + annotate('text', x = .75, y = .25,
             label = paste('AUC = ', round(calc_auc(p)$AUC, 3)))
```

### Confusion Matrix
Through confusion matrix, we can see this model does not perform as good as boosted model.
```{r, cache=TRUE}
pred_tree <- predict(svc_y, newdata = df_test, type = "response")
cm <- confusionMatrix(table(df_test$y, pred_tree),positive = "yes") #confusionMatrix is from the caret package
cm$table
cm$overall["Accuracy"]
cm$byClass["Sensitivity"]
cm$byClass["Specificity"]
```

Row
-----------------------------------------------------------------------
### Analysis Summary
It is very clearly that the linear boudary is not right to the data.


Row
-----------------------------------------------------------------------
### SV Machine
As the linear boudary doesn't work well, we want to try a non-linear boudary.

Row
-----------------------------------------------------------------------
### Using CV to find out optimal parameter values
We try ranges of parameter values for degree, scale, and C to see the optimal one. 
```{r, cache=TRUE}
cv_svm_2d <- train(y ~ .-y_ind,
                   data = df_train,
                   method = 'svmPoly',
                   trControl = fit_control,
                   tuneGrid = expand.grid(data.frame(degree = 2:4,
                                 scale = c(.001, .01, .1), 
                                 C = 1:3)))
cv_svm_2d
```

### Creating model with degree = 4, scale = 0.01, C = 1
```{r, cache=TRUE}
svm_y <- ksvm(y ~ .-y_ind,
              data = df_train,
              type = 'C-svc',
              kernel = 'polydot',
              kpar = list(degree = 4, scale = 0.01),
              C = 1,
              prob.model = TRUE)

svm_y
```

Row 
-----------------------------------------------------------------------
### ROC Curve for SVM

```{r, cache=TRUE}
fits_all <- mutate(fits_all,
                   svm_pprobs = predict(svm_y, newdata = df_test, type = 'prob')[, 2])

p <- ggplot(data = fits_all,
            aes(d= y_ind, m = svm_pprobs)) +
  geom_roc(n.cuts = 0, col = 'navy') +
  style_roc()
p + annotate('text', x = .75, y = .25,
             label = paste('AUC = ', round(calc_auc(p)$AUC, 3)))
```

### Confusion Matrix
We can see the non-linear boudary is not good.
```{r, cache=TRUE}
pred_tree <- predict(svm_y, newdata = df_test, type = "response")
cm <- confusionMatrix(table(df_test$y, pred_tree),positive = "yes") #confusionMatrix is from the caret package
cm$table
cm$overall["Accuracy"]
cm$byClass["Sensitivity"]
cm$byClass["Specificity"]
```

Row
-----------------------------------------------------------------------

### Analysis Summary
We are not likely to use this model to predict.


Row
-----------------------------------------------------------------------
### SV Machine Radial Basis
We use cross validation to find out optimal parameter like previously.

Row
-----------------------------------------------------------------------
### Using CV to find out optimal parameter values
We try ranges of parameter values for sigma, and C to see the optimal one.
```{r, cache=TRUE}
set.seed(1223)
cv_svm_rad <- train(y ~ .-y_ind,
                    data = df_train,
                    method = 'svmRadial',
                    trControl = fit_control,
                    prob.model = TRUE,
                    scale = TRUE,
                    tuneGrid = expand.grid(data.frame(sigma = c(.001, .01, .1),
                      C = c(1, 2, 3))))
cv_svm_rad
```

### Creating model with sigma = 0.001, C = 1
```{r, cache=TRUE}
svmr_y <- ksvm(y ~ .-y_ind,
              data = df_train,
              type = 'C-svc',
              kernel = 'rbfdot',
              kpar = list(sigma = 0.001),
              C = 1,
              prob.model = TRUE)

svmr_y
```

Row
-----------------------------------------------------------------------
### ROC Curve for SVM Radial Basis

```{r, cache=TRUE}
fits_all <- mutate(fits_all,
                   svmr_pprobs = predict(svmr_y, newdata = df_test, type = 'prob')[, 2])

p <- ggplot(data = fits_all,
            aes(d= y_ind, m = svmr_pprobs)) +
  geom_roc(n.cuts = 0, col = 'navy') +
  style_roc()
p + annotate('text', x = .75, y = .25,
             label = paste('AUC = ', round(calc_auc(p)$AUC, 3)))
```

### Confusion Matrix
In this confusion matrix, we can see all target variable in testing data is predicted as no.
```{r, cache=TRUE}
pred_tree <- predict(svmr_y, newdata = df_test, type = 'response')
cm <- confusionMatrix(table(df_test$y, pred_tree), positive = "yes")
cm$table
cm$overall["Accuracy"]
cm$byClass["Sensitivity"]
cm$byClass["Specificity"]
```

Row
-----------------------------------------------------------------------

### Analysis Summary
Although the AUC in this model is larger than the AUC in models of SVC and SVM with 4 degree, it cannot find 1s appropriately, so we are unlikely to use it.


Conclusion {data-orientation=rows}
=======================================================================

Row 
-----------------------------------------------------------------------
### Summary

In Conclusion, we can see random forest and boosting can help us predict better, so we are likely to select an optimal model from all models by observing ROC Curve, and Confusion Matrix. In the end, we choose the random forest model to help us predict, as the ROC Curve is more close to the top left than The Curve in boosted model.

Row
-----------------------------------------------------------------------
### Combining the ROC Curve in all models

```{r, cache=TRUE}
fits_all_along <- fits_all %>%
  dplyr::select('y_ind','dt_pprobs', 'rf_pprobs', 'gbm_pprobs', 'svc_pprobs', 'svm_pprobs', 'svmr_pprobs') %>%
  gather('method', 'prob', -1)
p <- ggplot(data = fits_all_along,
            aes(d = y_ind, m = prob, col = method)) +
  geom_roc(n.cuts = 0) +
  style_roc(xlab = '1 - Specificity', ylab = 'Sensitivity') +
  scale_color_brewer(palette = 'Dark2')
p + ggtitle(paste('AUC = ', paste(round(calc_auc(p)$AUC, 2), collapse = ', ')))
```

Row 
-----------------------------------------------------------------------
### Creating a summary of all models showing accuracy, sensitivity, and specificity.

```
Model                    Accuracy  Sensitivity  Specificity
decision tree              0.89       0.57         0.91
random forest              0.90       0.60         0.92
boosting                   0.90       0.60         0.93
SVC C=1                    0.89       0.57         0.88
SVM_4d scale=/C= 1         0.89       0.63         0.90
SVM_rad sigma=0.01/C= 1    0.88       0.00         0.88
```